shallow model
Export Reviews, Discussions, Author Feedback and Meta-Reviews
"NIPS Neural Information Processing Systems 8-11th December 2014, Montreal, Canada",,, "Paper ID:","1380" "Title:","Do Deep Nets Really Need to be Deep?" First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. The authors show empirical results on TIMIT and CIFAR-10 that shallow nets trained to mimic the outputs of DNNs and CNNs achieve comparable accuracy on these tasks. The paper is clearly written and makes compelling arguments. The contribution is significant because it suggests that SNNs are capable of learning complex functions that were thought to be learnable only with DNNs or CNNs. This means that better training algorithms have yet to be devised for SNNs.
Demonstrating Multi-Suction Item Picking at Scale via Multi-Modal Learning of Pick Success
Wang, Che, van Baar, Jeroen, Mitash, Chaitanya, Li, Shuai, Randle, Dylan, Wang, Weiyao, Sontakke, Sumedh, Bekris, Kostas E., Katyal, Kapil
This work demonstrates how autonomously learning aspects of robotic operation from sparsely-labeled, real-world data of deployed, engineered solutions at industrial scale can provide with solutions that achieve improved performance. Specifically, it focuses on multi-suction robot picking and performs a comprehensive study on the application of multi-modal visual encoders for predicting the success of candidate robotic picks. Picking diverse items from unstructured piles is an important and challenging task for robot manipulation in real-world settings, such as warehouses. Methods for picking from clutter must work for an open set of items while simultaneously meeting latency constraints to achieve high throughput. The demonstrated approach utilizes multiple input modalities, such as RGB, depth and semantic segmentation, to estimate the quality of candidate multi-suction picks. The strategy is trained from real-world item picking data, with a combination of multimodal pretrain and finetune. The manuscript provides comprehensive experimental evaluation performed over a large item-picking dataset, an item-picking dataset targeted to include partial occlusions, and a package-picking dataset, which focuses on containers, such as boxes and envelopes, instead of unpackaged items. The evaluation measures performance for different item configurations, pick scenes, and object types. Ablations help to understand the effects of in-domain pretraining, the impact of different modalities and the importance of finetuning. These ablations reveal both the importance of training over multiple modalities but also the ability of models to learn during pretraining the relationship between modalities so that during finetuning and inference, only a subset of them can be used as input.
Do Deep Nets Really Need to be Deep?
Currently, deep neural networks are the state of the art on problems such as speech recognition and computer vision. In this paper we empirically demonstrate that shallow feed-forward nets can learn the complex functions previously learned by deep nets and achieve accuracies previously only achievable with deep models. Moreover, in some cases the shallow nets can learn these deep functions using the same number of parameters as the original deep models. On the TIMIT phoneme recognition and CIFAR-10 image recognition tasks, shallow nets can be trained that perform similarly to complex, well-engineered, deeper convolutional models.
On the Number of Linear Regions of Deep Neural Networks
Guido F. Montufar, Razvan Pascanu, Kyunghyun Cho, Yoshua Bengio
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
ea8fcd92d59581717e06eb187f10666d-Paper.pdf
Currently, deep neural networks are the state of the art on problems such as speech recognition and computer vision. In this paper we empirically demonstrate that shallow feed-forward nets can learn the complex functions previously learned by deep nets and achieve accuracies previously only achievable with deep models. Moreover, in some cases the shallow nets can learn these deep functions using the same number of parameters as the original deep models. On the TIMIT phoneme recognition and CIFAR-10 image recognition tasks, shallow nets can be trained that perform similarly to complex, well-engineered, deeper convolutional models.
On the Number of Linear Regions of Deep Neural Networks
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
A Single Graph Convolution Is All You Need: Efficient Grayscale Image Classification
Fein-Ashley, Jacob, Ye, Tian, Wickramasinghe, Sachini, Zhang, Bingyi, Kannan, Rajgopal, Prasanna, Viktor
Image classifiers often rely on convolutional neural networks (CNN) for their tasks, which are inherently more heavyweight than multilayer perceptrons (MLPs), which can be problematic in real-time applications. Additionally, many image classification models work on both RGB and grayscale datasets. Classifiers that operate solely on grayscale images are much less common. Grayscale image classification has diverse applications, including but not limited to medical image classification and synthetic aperture radar (SAR) automatic target recognition (ATR). Thus, we present a novel grayscale (single channel) image classification approach using a vectorized view of images. We exploit the lightweightness of MLPs by viewing images as a vector and reducing our problem setting to the grayscale image classification setting. We find that using a single graph convolutional layer batch-wise increases accuracy and reduces variance in the performance of our model. Moreover, we develop a customized accelerator on FPGA for the proposed model with several optimizations to improve its performance. Our experimental results on benchmark grayscale image datasets demonstrate the effectiveness of the proposed model, achieving vastly lower latency (up to 16$\times$ less) and competitive or leading performance compared to other state-of-the-art image classification models on various domain-specific grayscale image classification datasets.
Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding
Bae, Sangmin, Ko, Jongwoo, Song, Hwanjun, Yun, Se-Young
To tackle the high inference latency exhibited by autoregressive language models, previous studies have proposed an early-exiting framework that allocates adaptive computation paths for each token based on the complexity of generating the subsequent token. However, we observed several shortcomings, including performance degradation caused by a state copying mechanism or numerous exit paths, and sensitivity to exit confidence thresholds. Consequently, we propose a Fast and Robust Early-Exiting (FREE) framework, which incorporates a shallow-deep module and a synchronized parallel decoding. Our framework enables faster inference by synchronizing the decoding process of the current token with previously stacked early-exited tokens. Furthermore, as parallel decoding allows us to observe predictions from both shallow and deep models, we present a novel adaptive threshold estimator that exploits a Beta mixture model to determine suitable confidence thresholds. We empirically demonstrated the superiority of our proposed framework on extensive generation tasks.
Abstractive Summarization as Augmentation for Document-Level Event Detection
Vidakoviฤ, Janko, Doลกiloviฤ, Filip Karlo, Pluลกฤec, Domagoj
Transformer-based models have consistently produced substantial performance gains across a variety of NLP tasks, compared to shallow models. However, deep models are orders of magnitude more computationally expensive than shallow models, especially on tasks with large sequence lengths, such as document-level event detection. In this work, we attempt to bridge the performance gap between shallow and deep models on document-level event detection by using abstractive text summarization as an augmentation method. We augment the DocEE dataset by generating abstractive summaries of examples from low-resource classes. For classification, we use linear SVM with TF-IDF representations and RoBERTa-base. We use BART for zero-shot abstractive summarization, making our augmentation setup less resource-intensive compared to supervised fine-tuning. We experiment with four decoding methods for text generation, namely beam search, top-k sampling, top-p sampling, and contrastive search. Furthermore, we investigate the impact of using document titles as additional input for classification. Our results show that using the document title offers 2.04% and 3.19% absolute improvement in macro F1-score for linear SVM and RoBERTa, respectively. Augmentation via summarization further improves the performance of linear SVM by about 0.5%, varying slightly across decoding methods. Overall, our augmentation setup yields insufficient improvements for linear SVM compared to RoBERTa.
Two Steps Forward and One Behind: Rethinking Time Series Forecasting with Deep Learning
Ughi, Riccardo, Lomurno, Eugenio, Matteucci, Matteo
The Transformer is a highly successful deep learning model that has revolutionised the world of artificial neural networks, first in natural language processing and later in computer vision. This model is based on the attention mechanism and is able to capture complex semantic relationships between a variety of patterns present in the input data. Precisely because of these characteristics, the Transformer has recently been exploited for time series forecasting problems, assuming a natural adaptability to the domain of continuous numerical series. Despite the acclaimed results in the literature, some works have raised doubts about the robustness and effectiveness of this approach. In this paper, we further investigate the effectiveness of Transformer-based models applied to the domain of time series forecasting, demonstrate their limitations, and propose a set of alternative models that are better performing and significantly less complex. In particular, we empirically show how simplifying Transformer-based forecasting models almost always leads to an improvement, reaching state of the art performance. We also propose shallow models without the attention mechanism, which compete with the overall state of the art in long time series forecasting, and demonstrate their ability to accurately predict time series over extremely long windows. From a methodological perspective, we show how it is always necessary to use a simple baseline to verify the effectiveness of proposed models, and finally, we conclude the paper with a reflection on recent research paths and the opportunity to follow trends and hypes even where it may not be necessary.